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1. INTRODUCTION 

The phenomenon of toxic speech has become an interesting topic to be discussed recently because 
of the way it has affected people. For example, a study conducted by Tirell described how toxic speech 
disassociated people into specific groups and also caused the death of over 800,000 in 1994 [1]. Nowadays, 
freedom of expression has contributed favorably to employing toxic speech to the extent that any outpouring 
of thought, especially social media, often leads to toxic speech [2]. In fact, the interaction on social media, 
both verbal and non-verbal, has become a current lifestyle performed by many groups, and the exchange is 
often interspersed by toxic speech [3], [4]. The social media platforms not only allow for improved 
communication, but they also allow internet users to express their thoughts, which are quickly shared with 
the rest of the world. Furthermore, given the users' many backgrounds, beliefs, ethnicity, and cultures on 
these platforms, many of them prefer to use derogatory, aggressive, and hostile language when conversing 
with those who do not share their background [5], [6]. The amount of hate speech and toxic content on the 
internet has steadily increased. Since the terms "profane," "hate," and "offensive" are used interchangeably, 
these have been grouped as "toxic" [7], [8]. 
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Meanwhile, this toxic speech needs to be classified using acoustic features based on their respective 
roles. Therefore, choosing the suitable subset of features will be helpful to reduce the computation 
complexity because not all attributes are relevant to the addressed classification problem. In order to achieve 
this, attribute selection and dimension reduction techniques are often used [9], [10]. The general stages of the 
feature selection process are shown in Figure 1 [11]. 

Past studies related to feature selection include feature optimization, dimension efficiency, and 
elimination of redundant features [12]. Furthermore, they aimed to improve the accuracy of the classification 
process in support vector machine (SVM) [13], [14]. The implementations of this classification process 
include feature selection by using Gaussian mixture model [13], automatic speech assessment [15], heart 
sound features in the frequency and time domain [16], music genre classification [17], [18], and audio-visual 


recognition [19], [20]. 


Original 
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stopping 
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Subset Validation 


Figure 1. Stages of the feature selection process 


There are three main reasons for implementing the reduction of dimensions, including minimizing 
the learning costs, improving the model performance, and eliminating similar or excessive dimensions [21]. 
This reduction is conducted because not all features are useful in the classification process. Also, some 
irrelevant features, known as noise, often reduce the accuracy. The deep learning algorithms perform this 
function by searching through the subset of possible feature spaces and evaluating each subset with quality 
performance. Furthermore, a sequential forward selection strategy can also be used to enhance this process. 
According to Ververidis and Kontropoulus [22], a simple sequential search strategy helps produce results 
quickly, as seen in (1), where F, is the cardinality of the selected features, and the other F is the original 
feature set. 


O(F + (F-1)+ (F-2) +--+ (F-F,+1)) (1) 


Based on the brief introduction, toxic speech classification and feature selection are the main 
problems in this research. This study proposed a solution for both feature selection for toxic speech 
classification. We used a wrapper method with a forward feature selection technique for 2,000 data samples, 
and 72 features were used. Subsequently, this proposed method was applied in SVM and random forest (RF) 
algorithms in order to classify toxic speech. We also compared the RF result with the SVM result using 
selected features and the original feature set. 

Moreover, this study is organized as: section | introduces toxic speech and how the number of 
speech features is decreased intentionally while still retaining the best classification result. Section 2 
elucidates the method used for data preparation, speech features, and their selection method and classifier 
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model. Furthermore, section 3 presents all results from the experiments’ scenario, while Section 4 draws 
conclusions and interpretations from our result. 


2. METHOD 

A sequential search strategy was proposed using a wrapper method with a forward selection 
algorithm. The selected features are then used to classify toxic speech using SVM and RF. Hence, the stages 
in this experiment as shown in Figure 2 include data retrieval, pre-processing of both training and test data, 
feature extraction, feature selection, and finally, the classification and evaluation process. 


RAW Voice Data iti ses io Feature Extraction 


Classification Result Toxic Speech Feature Selection 
and Cross-Validation Classification 


Figure 2. Experimental flow design 


2.1. Dataset 

The data used was collected from YouTube by using the search keywords "Online Debt" and 
"Online Fraud," with several video features containing conversations between two people. Consequently, 79 
recordings were obtained with varying duration of 7-54 minutes. However, some of these videos contain 
toxic speeches with less noisy audio quality; hence the audio part was extracted and converted into a WAV 
file format. 


2.2. Data preparation 

In this section, the selection process was carried out manually by listening to the recording. The 
speech data was classified into toxic and non-toxic speeches manually as well. We reviewed about 2,000 
speech data after selection process, with the composition of 1,273 and 763 toxic and non-toxic speeches, 
respectively. According to Oxford dictionary media [19] and Webster [20], toxic speech is a sentence 
containing intimidation, hate speech, and curses that affect the other person psychologically and emotionally. 
Moreover, the database used was obtained from Schofield et al. [23]. Eight main features were extracted 
from the 2,000 sound clips using openSMILE with INTERSPEECH 2010 Challenge parameterization [24]. 
The feature set includes energy, intensity, loudness, jitter local, JitterDDP, shimmer, harmonic noise-to-ratio, 
zero-crossing rate, and its nine constant statistical sub-features of maximum, minimum, range, maximum 
position, minimum position, mean, standard deviation, skewness, and kurtosis, respectively. 

Several studies have shown, speech features obtained significant results [25]. For example, energy 
was used to detect a voiced or unvoiced speech, and the result showed that voiced speech has a greater 
energy value than unvoiced speech [26]. Similarly, sound wave intensity affects the classification of toxic 
sentences. By definition, the intensity of the sound wave (/) on a surface is the average rate per unit area in 
which energy is transferred through or to the surface. This intensity is formulated as seen in (2), where T is 
the energy transfer time from sound waves, and A is the surface area that cuts off the sound. 


T 
[=- (2) 
B = 10log-- (3) 


Additionally, intensity can also be defined as sound level ((), as shown in (3), where dB stands for 
decibel, a unit for the level of a sound. J, is the reference standard intensity 1,0 x 107** W/m? [27]. Another 
feature employed to classify toxic sentences is loudness, which is a volume measurement. This feature is 
used because whenever people are angry, they tend to make a loud voice and sometimes accompanied by 
curse words. However, it does not mean that every loud voice contains toxic speech. 

Also, Jitter, also known as Relative Jitter, was used, and it represents the ratio of the average and 
sequential fundamental period. Parameter values related to the changes in the fundamental period increases 
due to the presence of irregular glottal vibrations [28]. Another feature used is JitterDDP, the difference in 
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absolute average between successive periods divided by the average period [28], shown in (4). 


JitterDDP = — |t, - tis (4) 


1 
-1 


Shimmer = —) |20109 (424) 


(5) 


The sixth feature is the shimmer local shown in (5), and it is used to identify micro signals from the 
amplitude. This feature states the relative average perturbation per period of amplitude. However, this value 
appears and escalates in the case of organic and functional vocal pathology [28]. 

The seventh feature is the harmonic noise-to-ratio (HNR), which assesses the ratio between periodic 
and non-periodic components consisting of sound segments that are voiced [29]. The periodic component 
arises from the vibration of the vocal cords, whereas the non-periodic component arises from the glottal 
noise; meanwhile, both are usually expressed in dB. Basically, the greater the flow of air from the lungs into 
the vocal cords, the greater the HNR. In this case, low HNR indicates asthenic sounds and dysphonia. HNR is 
considered pathological when the value is less than 7 dB, as described in Boersema’s study [30], and it is 
formulated in (6). 

Even though the formula in (6) works indirectly as the definition of the frequency domain, it still 
produces more accurate results; hence the precision is estimated as autocorrelation 7,(t) = f x(t)x(t + t)dt, 
where x(t) is the time signal, and T is defined as the time lag. For perfect periodic sounds, HNR has no limits 
[30]. 


HNR(in dB) = 10. log 20m > 


7 
1-1x(Tmax ) 


The zero-crossing rate (ZCR) is the last feature, which indicates the number of times the amplitude 
passes the O value in the signal data bit. Unvoiced speech often has a higher zero-crossing value than the 
voiced [26]. Hence, zero-crossing is detected when the data sample of a digital signal has a different sign. 
Consequently, the following ZCR is formulated as seen in (7), where func is a function that indicates zero 
value when it is negative and 1 when it is positive. Furthermore, Z,, is the zero-crossing characteristic value, 
while N is the total number of bits contained in the frame w, and x[m] is the amplitude value in the m-index 
data [31]. 


Zw = 5 dma [func(x[m] — func(x[m — 1))| 7) 


2.3. Feature selection 

There are two possible approaches for feature selection, namely the wrapper and filter approaches. 
Both are done by selecting a subset of features before the data mining algorithm is carried out. Meanwhile, 
the difference between the two approaches lies in the evaluation stage. For example, the wrapper approach 
uses the target of the selected algorithm. It then seeks in sequence, while the filter uses a separate evaluation 
technique from data mining algorithms for its evaluator [32]. 

Therefore, a wrapper method with a forward selection technique was proposed, which uses a simple 
search algorithm based on a linear regression model to reduce the dimensions of the dataset by eliminating 
redundant and irrelevant attributes [33]. Moreover, Python was used for implementing the forward selection 
framework originally described by Zhang [34]. Table 1 shows the original features extracted from speech 
data and its sub-features, while Table 2 shows the selected ones after the forward feature selection technique 
has been implemented. Hence, the selected features are used for the latter classification process. 


Table 1. List of original speech features 
Features Number of Sub-features 
Energy, 
Intensity, 
Loudness, 

Jitter Local, 
JitterDDP 
Shimmer, 

Harmonic Noise-to-Ratio 
Zero-Crossing Rate 
Total Features 


Z 
io) 
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Table 2. List of selected features 


No Features Sub-features 
1 HNR Skewness 

2 Energy Skewness 

3 Loudness Minimum 

4 Loudness Maximum Position 
5 Loudness Standard Deviation 
6 Jitter Local Minimum 

7 Shimmer Local Mean 

8 Shimmer Local Standard Deviation 
9 ZCR Minimum 

10 ZCR Skewness 

Total Features 10 


721 


2.4. Classification methods 

In this study, SVM and RF algorithms were used for classification. According to Pierre-Yves, SVM 
was a binary classifier model with a two-stage classification [35]. In the first stage, kernel functions are used 
to change the dimensions of features from low to high. Afterward, the non-linear data found in the highest 
dimension is transformed into a linear one. In the second stage, the maximum hyperplane distance was 
constructed to determine the decision boundary for each class [36]. Based on (8), we implement SVM using 
the Orange3 toolbox [37], where w; is the Lagrange Multiplier, b is the value limit, and Z is the kernel 
function. The default linear kernel was used for the SVM. 


f(x) = sign Yin wil: Z(xj; X x) +b (8) 


However, RF is used for the classification process in the second algorithm. According to Probst et al. 
RF algorithm depends on random vector values taken independently with the distribution in the same forest 
[38]. Furthermore, it utilizes ensemble learning and a prediction method with several stages of the learner. 
For example, Bootstrap aggregation, also known as bagging, is one of the Ensemble Learning algorithms 
used in RF. This method is formulated in (9), where X = x,,..,X,, denotes a training set, Y = 
Y1,++,Yn, denotes response, and B denotes the bagging repetition. Samples with replacement contents are 
Xp,Ypy, whereas the amount of data is denoted by n [39]. The regression tree is denoted by f, on Xj, Y,, and 
after the training process, x’ denotes the prediction results. 


f= DoW a fo) (9) 


3. RESULTS AND DISCUSSION 

The selection process begins with preparing data and feature subsets, followed by toxic speech 
classification using SVM and RF algorithms with random samples from 2,000 data having training data and 
test data in the ratio of 80:20, 70:30, and 60:40. Hyperparameter of the kernel is set to Radial Basis Function, 
gamma is set to 1/72 (number of features), C parameter is set to 1, and nu parameter is set to 0.5, for SVM. 
Hyperparameter of tree number is set to 100, maximal depth of trees is set to 5, and seed is set to 1 for RF 
[40]. In this process, validation was done using the confusion matrix (CM) method [41]. However, before this 
evaluation stage with CM, cross-validation was implemented with 3-folds and 10-folds by repeatedly running 
the random samples of 10 times with training and test data in the ratios 60:40, 70:30, and 80:20. Initially, the 
data were processed by using two learning algorithms from SVM and RF. Cross-validation and random 
samples were obtained in order to determine the class predictions from 2,000 variables. Therefore, the results 
are inputted to a CM to observe the wrong variable classified by the learning machine. 


3.1. Test results with random sample 

The experimental result showed that the SVM and RF algorithm improves the accuracy and 
decreases the computing time on the entire training and test data composition when the random sample 
method was used. According to Table 3, the SVM classifier has an upward trend of accuracy in the ratio of 
all training and test data, ranging from 96.5% to 97.3%; then 96.7% to 97.3%, and 96.8% to 97.5%, for each 
data ratio of 80:20, 70:30, and 60:40 respectively. Meanwhile, the RF produces a less significant increase in 
the ratios of 80:20 and 70:30, which experienced the exact change from 94% to 95.1%, with an increase of 
1.1%. Similarly, the data ratio of 60:40 received an increase of 0.2% from 93.5% to 94.7%. 
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Table 3. Accuracy results of random sample classification 


Data Ratio (Train:Test) All Features Forward Selection 
SVM RF SVM RF 

80:20 96.5% 94% 97.3% 95.1% 

70:30 96.7% 94% 97.3% 95.1% 

60:40 96.8% 93.5% 97.5% 94.7% 


According to Table 4, a significant difference is seen in the computational training time of the SVM 
algorithm, where the data ratio of 80:20 with an initial time of 10.578 seconds becomes 3.022. Also, the 
70:30 with an initial time of 6.586 seconds was processed faster to 2.56 seconds, while the 60:40 data with an 
entire feature duration of 6.558 seconds becomes 3.042 seconds. The RF algorithm also decreased in the 
computational training process. For example, the data ratio of 80:20, which takes 5.257 seconds for all 
features, eventually requires 0.688 after feature selection. Likewise, the 70:30, which requires 3.645 seconds, 
changes to 0.753 seconds, while the 60:40 with initial time, 3.885 seconds, becomes 1.169 seconds. 
Generally, the RF algorithm experiences a significant downward trend with an average time among all 
features. Hence it shows an excellent result. 


Table 4. Computational time in training process of random sample (in seconds) 


Data Ratio (Train:Test) All Features (s) | Forward Selection 
SVM RF SVM RF 


80:20 10.578 5.257 3.022 0.688 
70:30 6.586 3.645 2.56 0.753 
60:40 6.558 3.885 3.042 1.169 


Table 5 presents the results of the comparison of test data, where the SVM algorithm takes a test 
time of 1.402 seconds to 0.198 with a data ratio of 80:20. Then at the data ratio of 70:30, it changes from 
0.921 seconds to 0.291, while the 60:40 showed a time changed from 1.241 seconds to 0.552. Also, the 
algorithm for each data ratio takes 0.841 to 0.698, 0.718 to 0.108, and 0.115 to 0.166. Based on these results, 
both algorithms tend to decrease or have a more efficient test time. 


Table 5. Computational time in testing process of random sample (in seconds) 
Data Ratio (Train:Test) All Features (s) Forward Selection 
SVM RF SVM RF 


80:20 1.402 0.841 0.198 0.698 
70:30 0.921 0.718 0.291 0.108 
60:40 1.241 0.115 0.552 0.166 


3.2. Test results with cross-validation 

According to Table 6, the comparison of test results using 3-fold cross-validation shows that the 
accuracy value in the forward selection is more significant than the previous one. Also, the time column of 
training and testing decreased after the selection because the reduction in the number of features is directly 
proportional to the time needed for the computing process. According to Table 7, the 3-folds validation of 
SVM has a difference of 1.227 seconds, while RF has 1.915. However, at 10-folds, the difference is 
6.008 seconds for SVM and 3.511 for the RF algorithm. Hence, the comparison of test time in Table 6 is 
relatively decreased between before and after feature selection. Table 8 also showed that the 3-folds 
validation of SVM has a difference of 0.24 seconds, while RF has 0.149. However, for the 10-folds, the 
difference is 0.459 seconds for SVM and 0.467 for the RF algorithm. This finding also indicates the 
comparison of test time in Table 8 also decreases between before and after feature selection. 


Table 6. Accuracy results of the cross-validation test 
k-fold All Features Forward Selection 
SVM RF SVM RF 
3-fold 97.1% 93.8% 97.7% 94.8% 
10-fold 97.1% 94.1% 97.7% 94.9% 
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Table 7. Computational time in cross-validation test (in seconds) 
k-fold All Features Forward Selection 
SVM RF SVM RF 
3-fold 2.115 2.143 0.888 0.228 
10-fold 9.989 4.383 3.981 0.872 


As seen in the computational process and the accuracy value, the overall result obtained before and 
after the feature selection significantly changed. Forward feature selection generally impacts training time, 
testing, and accuracy results. However, this finding shows that feature selection does not guarantee an 
increase in accuracy or computation time improvements. Therefore, more tests are needed. 


3.3. Confusion matrix (CM) evaluation with cross-validation 

After conducting the test using cross-validation and random sample results, the next step is to 
process the CM to examine the number of instances that were not predicted. Table 8 also shows that less than 
5% of instances were not predicted, indicating 1.900 instances are successfully predicted. Therefore, the 
difference before and after being predicted is 1%. However, it is 4.5% true negative instances when using the 
RF algorithm (an increase in 80 instances was successfully predicted). 

The results of SVM algorithms between Tables 8 and 9 do not differ significantly since the 3-fold 
and 10-fold have the same predicted instances results. Whereas in the RF algorithm, the difference is only at 
0.01%, which means that the number of folds does not increase the accuracy of the prediction. Moreover, the 
terms "0" and "1" in Table 8, Table 9, and Table 10 mean the class of toxic and non-toxic, respectively. 


Table 8. Confusion matrix of 3-fold cross-validation with 2,000 instances 


pan had All Features Forward Selection 
Classification 0 1 0 1 
SVM 0 99.2% 4.6% 99.2% 3.6% 
1 0.1% 95.4% 0.15% 96.4% 
RF 0 99.2% 9.1% 94.1% 4.8% 
1 0.8% 90.9% 5.9% 95.2% 


Table 9. Confusion matrix of 10-fold cross-validation with 2,000 instances 


sien Ohs All Features Forward Selection 
Classification 0 1 0 1 
SVM 0 99.2% 4.6% 99.2% 3.6% 
1 0.0% 95.4% 0.0% 96.4% 
RE 0 98.9% 9.2% 94.2% 4.7% 
1 1.1% 90.8% 5.8% 95.3% 


Table 10. Confusion matrix of 3 data ratios 


Classification All Features Forward Selection 
0 1 0 1 
80:20 Ratio 
SVM 0 99.2% 4.6% 99.2% 3.6% 
1 00% 95.4% 0.0% 96.4% 
RF 0 98.9% 9.2% 94.2% 4.7% 
1 1.1% 90.8% 5.8% 95.3% 
70:30 Ratio 
SVM 0 99.2% 5.0% 99.2% 4.2% 
1 0.0% 95.0% 0.0% 95.8% 
RF 0 99.5% 9.0% 94.6% 5.0% 
1 0.50% 91.0% 5.4% 95.0% 
60:40 Ratio 


0 99.2% 5.0% 99.2% 4.0% 
1 0.0% 95.0% 0.0% 96.0% 
RF 0 99.3% 910% 95.4% 5.70% 
10.70% 90.9% 46% 94.3% 


3.4. Confusion matrix (CM) evaluation with random sample 

Table 10 compares the results of instance predictions with ratios of three different data. In the 80:20 
with the SVM algorithm, the prediction of true negative instances increased by 1.2%. The RF algorithm has 
increased much higher to 4.2%, from the initial value of 91% to 95.2%. Furthermore, in the 70:30 data ratio, 
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SVM increased by 0.8% true negative variable, while RF escalated by 4%. At a ratio of 60:40, SVM also 
intensified by 1%, and RF increased by 3.4% on true negative variables. Based on these results, the 
improvement in the SVM prediction rate for each data ratio did not experience any decrease in the level of 
prediction accuracy but persisted at 99.2%. Therefore, the SVM algorithm tends to enhance accuracy but not 
as high as RF because its difference from RF is only 0.8-1.4% with a 4% rise. Moreover, Table 11 shows a 
comparison between our proposed work and previous work. Our proposed feature selection work improves 
the accuracy into 99.5%, the number which is considered a common target of Deep Learning researchers. 


Table 11. Comparison of work 


Author Year Deep Learning Model Accuracy 
Sharma et al. [42] 2018 Naive Bayes, SVM, and RF 73.42% (using Naive Bayes), 71.71% (using SVM), 
and 76.42% (using RF) 

Oriola and Kotzé [43] 2019 SVM 95.74% 

Juuti et al. [44] 2020 Generative Pre-trained Transformer 2 97.3% 

(GPT-2) 
d’Sa et al. [45] 2020 BiLSTM-CNN 97% 
Malik et al. [7] 2021 BiLSTM and CNN 96.2% (using BiLSTM) and 95.42% (using CNN) 
Our Proposed Work 2021 SVM and RF 99.2% (using SVM) and 99.5% (using RF) 


4. CONCLUSION 

Toxic speech is still a continual threat; hence every piece of information is implicitly dangerous, 
especially when it is not prevented. Consequently, a classification of toxic speech using the SVM algorithm 
with the cross-validation test method and random samples obtained is considered appropriate to solve this 
problem. Our work's accuracy, training time, and test results have shown a positive trend, compared to the 
original feature’s results. A total of 10 features were obtained after a selection process from 72 initial ones. 
The wrapper method with the forward feature selection technique also increased the computation time and 
accuracy by 90%, which increases up to 5% in the RF algorithm and 0.4%-1.2% for the SVM algorithm. 
Thus, the forward selection is suitable for classifying toxic speech features. Also, it is implied that the 
number of features in the classification does not guarantee predictions with a high degree of accuracy. We 
expect some future work, such as the employment of different speech features of mel-frequency cepstral 
coefficient (MFCC), Voice Onset Time, Signal Noise-to-Ratio (SNR), speech rate, and many more. 
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